HTML Page Analysis Based on Visual Cues

نویسندگان

  • Yudong Yang
  • HongJiang Zhang
چکیده

In this paper, we present a novel approach to automatically analyzing semantic structure of HTML pages based on detecting visual similarities of content objects on web pages. The approach is developed based on the observation that in most web pages, layout styles of subtitles or records of the same content category are consistent and there are apparent separation boundaries between different categories. Thus these subtitles should have similar appearances if they are rendered in visual browsers and different categories can be separated clearly. In our approach, we first measure visual similarities of HTML content objects. Then we apply a pattern detection algorithm to detect frequent patterns of visual similarity and use a number of heuristics to choose the most possible patterns. By grouping items according to these patterns, we finally build a hierarchical representation (tree) of HTML document with “visual consistency” inferred semantics. Preliminary experimental results show promising performances of the method with real web pages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering of Web Pages based on Visual Similarity

Finding the appropriate information on the web is a very tedious job. There is a need to organize the data by classifying the data into categories. This categorization of web pages can be achieved by clustering. The clustering is done by analyzing the content of the HTML page by extracting the keywords. Based on the keywords extracted the page is evaluated and clustered. But the visual feature ...

متن کامل

Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification

Standard techniques for a web page classification usually take a simple text-based approach, in which most of the information provided by the visual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical structure – Visual Ad...

متن کامل

VIPS: A VIsion based Page Segmentation Algorithm

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure ...

متن کامل

Content Based Web Sampling

Web characterization methods have been studied for many years. Most of these methods focus on textbased web contents. Some of them analyze the contents of a web page by analyzing its HTML code, hyper links, and/or DOM 1 structure. Seldom, a web page is characterized based on its visual appearance. A good reason for also considering the visual appearance of a web page is because humans initially...

متن کامل

Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web

A system is proposed that combines textual and visual statistics in a single index vector for content-based search of a WWW image database. Textual statistics are captured in vector form using latent semantic indexing (LSI) based on text in the containing HTML document. Visual statistics are captured in vector form using color and orientation histograms. By using an integrated approach, it beco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001